On this page

Skip to content

Using Subtitle Edit to Integrate Faster-Whisper for Local Speech-to-Text

TLDR

  • Faster-Whisper utilizes the CTranslate2 engine and 8-bit quantization technology, delivering approximately 4x better performance than the original Whisper with lower memory usage.
  • It is recommended to use Subtitle Edit to integrate Faster-Whisper-XXL, which avoids common dependency conflict issues encountered when installing Python environments directly.
  • Model selection advice: Choose large-v3 for maximum accuracy, or large-v3-turbo for a highly recommended balance of speed and efficiency.
  • Empirical tests show that large-v3-turbo takes only about 16 seconds to transcribe 5 minutes of audio, demonstrating excellent performance.

Technical Advantages of Faster-Whisper

Faster-Whisper is a version of Whisper implemented based on CTranslate2 (a fast inference engine for Transformer models). Compared to the original OpenAI Whisper, its main advantages are:

  • Faster Speed: Performance is improved by approximately 4 times or more.
  • Lower Memory Usage: Through 8-bit quantization technology, VRAM requirements are significantly reduced.

For users who want to perform speech-to-text (STT) locally without causing system lag, this is currently the superior choice.

Solving Python Dependency Installation Issues

When does this issue occur: When users attempt to install standalone executables like Faster-Whisper-XXL directly, but fail to run them due to dependency conflicts with Python audio/video packages.

It is recommended to use Subtitle Edit for integration instead. Subtitle Edit is a powerful subtitle editing software, and its built-in integration features automatically handle environment configuration, making the process relatively stable and smooth.

Subtitle Edit Integration Steps

  1. Open Subtitle Edit and select "Video" -> "Audio to text (Whisper)..." from the menu.
  2. If the system prompts you to download ffmpeg, click confirm.
  3. Select "Purfview's Faster-Whisper-XXL" in the Engine option. If the component is not installed, the system will automatically prompt you to download it.
  4. Download the model from the "Choose model" dropdown menu.

TIP

Model Difference Explanation:

  • Large-v3: Currently the most accurate model with the most parameters, but inference is slower and requires more memory.
  • Large-v3-Turbo: A distilled version of v3 that reduces Decoder Layers from 32 to 4. Parameters are reduced by about 48%, speed is increased by about 8 times, and English recognition accuracy is almost identical to the full version.
  1. Drag the video/audio file into the window and click "Generate" to start recognition. If you need to convert audio files like mp3, remember to adjust the file type filter.

Performance Analysis

When does this issue occur: When you need to evaluate the transcription speed and accuracy of different models locally.

Test environment: PNY RTX 4070 Ti Super 16GB Blower, test material is a 5-minute and 16-second mp3 file.

  • Test Results:
    • Using large-v3-turbo: approx. 16 seconds.
    • Using large-v3: approx. 32 seconds.

Observing the data, the execution speed of large-v3 has significantly surpassed older tools. Although the accuracy improvement of large-v3 is limited when processing complex audio such as songs, its execution efficiency is sufficient to meet daily local speech-to-text needs.

Changelog

  • 2026-01-30 Initial document creation.